Author

Lisa Levoir

Published

November 8, 2023

1 Data import and cleaning

In this section I import the data and prepare it for analysis.

Code
```{r setup}
#| warning: false
#| message: false

library(tidyverse)
library(plotly)
library(dplyr)
library(tidyr)
library(knitr)
library(table1) #Create HTML Tables of Descriptive Statistics https://cran.r-project.org/web/packages/table1/vignettes/table1-examples.html
#library(OMTM1) #https://github.com/schildjs/OMTM1/
library(Hmisc)
library(rms) # Regression Modeling Strategies by Frank https://cran.r-project.org/web/packages/rms/index.html
library(modelsummary) #Summary Tables and Plots for Statistical Models and Data: Beautiful, Customizable, and Publication-Ready https://cran.r-project.org/web/packages/modelsummary/index.html
library(scales) # The scales packages provides the internal scaling infrastructure used by ggplot2, and gives you tools to override the default breaks, labels, transformations and palettes. https://scales.r-lib.org
library(viridis) #colors
library(cowplot) #allows me to use plotgrid
library(gridExtra) #adding tables to plots
library(visdat) #shows missing data
library(GGally) #makes pairs plots
library(sandwich) #for robust standard errors
library(gtsummary) #makes the tables look nicer

setwd("/Users/lisalevoir/BIOS7351_Collab/data_project2") #this line I would need to run in the console
knitr::opts_knit$set(root.dir = "/Users/lisalevoir/BIOS7351_Collab/github/BIOS_Collaboration/project_2_analysis") #now I set global options for knitting, I also had to toggle global options > R Markdown > evaluate chunks in current directory


#import the data
dat <- read.csv("/Users/lisalevoir/BIOS7351_Collab/data_project2/combined_data_203.csv")

#to compare that the merging went as expected
VUMSdat <- read.csv("/Users/lisalevoir/BIOS7351_Collab/data_project2/DATA_VUMS.csv")
HMSdat <- read.csv("/Users/lisalevoir/BIOS7351_Collab/data_project2/DATA_HMS.csv")
UVAdat <- read.csv("/Users/lisalevoir/BIOS7351_Collab/data_project2/DATA_UVA.csv")
```

1.0.1 Inclusion/Exclusion

Criteria to exclude students who most likely took a scored exam:

  • Any PhD students (n = 2)

  • Any 5th year program students (n = 19)

  • M4 students at Vanderbilt (n = 5)

  • Students who did not complete either Step survey (n = 2)

  • Students who specifically stated they took a scored Step 1 (n=1)

Based on our criteria we would exclude record IDs:

  • VUSM: 23, 26, 39, 40, 54, 3, 8, 12, 49, 60, 62, 64

  • HMS: 1, 21, 28, 30, 34, 37, 39, 41, 44, 47, 49, 61

  • UVA: 33, 47, 80, 81, 83

[1] "uworld_percent_step1_1" "uworld_percent_step1_2"
[1] "amboss_percent_step1_1" "amboss_percent_step1_2"
[1] "length_step1_1" "length_step1_2"
[1] "practicetest_step1_1" "practicetest_step1_2"
[1] "full_test_practice_step1_1" "full_test_practice_step1_2"
[1] "push_step1_1" "push_step1_2"
[1] "push_practice_test_step1_1" "push_practice_test_step1_2"
[1] "push_nbme_practice_score_step1_1" "push_nbme_practice_score_step1_2"
[1] "push_uw_practice_score_step1_1" "push_uw_practice_score_step1_2"
[1] "final_nbme_practice_score_step1_1" "final_nbme_practice_score_step1_2"
[1] "final_uw_practice_score_step1_1" "final_uw_practice_score_step1_2"
[1] "score_step1_1" "score_step1_2"
[1] "other_resources_step1_1" "other_resources_step1_2"
[1] "other_step1_1" "other_step1_2"
[1] "changes_step1_1" "changes_step1_2"
[1] "Above is a list of columns I have combined for you. Hope it looks right!"
[1] "uworld_percent_step2_1" "uworld_percent_step2_2"
[1] "amboss_percent_step2_1" "amboss_percent_step2_2"
[1] "length_step2_1" "length_step2_2"
[1] "practicetest_step2_1" "practicetest_step2_2"
[1] "full_test_practice_step2_1" "full_test_practice_step2_2"
[1] "practice_score_step2_1" "practice_score_step2_2"
[1] "practice_test_step2_1" "practice_test_step2_2"
[1] "score_step2_1" "score_step2_2"
[1] "target_score_step2_1" "target_score_step2_2"
[1] "Above is a list of columns I have combined for you. Hope it looks right!"

There were 203 participants after the the clients did data exclusion by hand, as compared to the total survey size which was 182 unique individuals included in the step 1 data.

These are the record IDs that were excluded by the collaborator: VUSM 23, VUSM 60, HMS 37, HMS 41, HMS 49, UVASOM 33, UVASOM 80, UVASOM 84.

For our analysis, I originally included everyone from the 3 school specific data sets. Then, we flagged individuals to remove and I remade the source data set. These are the records IDs that we decided to exclude:

  • VU record ids: 23, 26, 39, 40, 54, 3, 8, 12, 49, 60, 62, 64

  • HMS record ids: 1, 21, 28, 30, 34, 37, 39, 41, 44, 47, 49, 61

  • UVA record ids: 33, 47, 80, 81, 83

Below are tables for Step 1 and Step 2 response inclusion by school:

Code
kable(table(step1_complete$schoolid), caption = "Number of Step 1 responses included by school")
Number of Step 1 responses included by school
Var1 Freq
HMS 50
UVa 79
VU 53
Code
kable(table(step2_complete$schoolid), caption = "Number of Step 2 responses included by school")
Number of Step 2 responses included by school
Var1 Freq
HMS 39
UVa 65
VU 34

Notice that unfortunately for Step 2 analysis, we had to drop 44 individuals who did not report a step 2 score. These raw counts and frequencies of people who did not give a step 2 score (and are therefor not eligible for analysis) are listed by institution below.

Code
# describing the missingness
kable(table(drop_these_with_no_outcome$schoolid), caption = "Number of missing Step 2 scores by institution")
Number of missing Step 2 scores by institution
Var1 Freq
HMS 11
UVa 14
VU 19
Code
num <- as.vector(table(drop_these_with_no_outcome$schoolid))
denom <- as.vector(table(step2_complete$schoolid))
nonresponse_freq <- setNames(c(round(num/denom, 3)), c(names(table(step2_complete$schoolid))))
kable(nonresponse_freq, caption = "frequency of missing step 2 scores by instition")
frequency of missing step 2 scores by instition
x
HMS 0.282
UVa 0.215
VU 0.559

We also find it helpful to visually profile the missingness in this graphic below. Some of this missingness is due to questions being hierarchical.

Code
######### visually profile missing responses
p1 <- visdat::vis_miss(step1_complete)
p2 <- visdat::vis_miss(step2_complete)

title_gg <- ggdraw() + draw_label("Response missingness for pooled survey results across exam order")
gridded <- plot_grid(p1, p2, label_size = 12, ncol = 2, align = "hv", label_x = 0.5,
          labels = c("Step 1","Step 2"))
plot_grid(title_gg, gridded, nrow = 2, rel_heights = c(0.1, 0.9))

2 Analysis

Note, we plan to use the robust/sandwich variance estimator for regression models. One inclusion criteria is that the outcome variable (“Y”) must be available for a subject to be included in the analysis question (ie. if they did not report a step 2 score, we won’t perform relevant step 2 analysis on them).

::: panel-tabset ## Q1 Does order have an impact on Step 2 scores?

We cannot analyze Step 1 since all survey responses reported passing For Step 2 scores, I will perform a linear regression with:

  • Y = Step 2 score
  • X = factor ( 1 = “step 1 first”, 2 = “step 2 first”, 3 = “only step 1”, 4 = “only step 2”)
  • Z = school (need to adjust for this)
Code
step2_complete[ ,"order_factor"]  <- factor(step2_complete$exam_order, levels = c(1,2, 3, 4), labels = c("step 1 first", "step 2 first", "only step 1", "only step 2"))

step2_complete[ ,"which_exam_first"] <- ifelse(step2_complete$order_factor == "step 1 first" | step2_complete$order_factor == "only step 1", 1, 2)

step2_complete[ ,"school_factor"]   <- factor(step2_complete$schoolid, labels = c("HMS", "UVA", "VUMS"))

table(step2_complete$order_factor)

step 1 first step 2 first  only step 1  only step 2 
          71           67            0            0 
Code
table(step2_complete$which_exam_first)

 1  2 
71 67 
Code
exam_order_mod <- lm(formula = score_step2 ~ which_exam_first + school_factor, data = step2_complete)
sandwich(exam_order_mod) #this gives the sandwich variance
                  (Intercept) which_exam_first school_factorUVA
(Intercept)         15.312108       -11.102169         4.924963
which_exam_first   -11.102169        10.495092        -7.995276
school_factorUVA     4.924963        -7.995276        14.443649
school_factorVUMS   -1.980325        -1.508644         4.717851
                  school_factorVUMS
(Intercept)               -1.980325
which_exam_first          -1.508644
school_factorUVA           4.717851
school_factorVUMS          5.933156
Code
summary(exam_order_mod) #model summary for reporting

Call:
lm(formula = score_step2 ~ which_exam_first + school_factor, 
    data = step2_complete)

Residuals:
     Min       1Q   Median       3Q      Max 
-112.100   -5.178    2.767    8.761   21.900 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       261.3383     5.2033  50.225   <2e-16 ***
which_exam_first    0.8952     4.5256   0.198    0.844    
school_factorUVA   -4.0290     5.0342  -0.800    0.425    
school_factorVUMS  -0.9797     3.5128  -0.279    0.781    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.69 on 134 degrees of freedom
Multiple R-squared:  0.01018,   Adjusted R-squared:  -0.01198 
F-statistic: 0.4593 on 3 and 134 DF,  p-value: 0.7112

Based on the model output (p values), there doesn’t appear to be any signifigant associations between exam order and the score for Step 2. Since the R^2 value is essentially 0, I conclude that there was no effect of exam order on step 2 scores.

2.1 Q2 What factors affect Step 2 score?

Again, we cannot analyze Step 1 scores since all respondents reported passing.

Based on our SAP, if there are any covariates with more than 30% of responses missing, we will drop that variable or populate it with 0, depending on context. For example, the percent of Amboss questions completed will be filled with 0 for people who didn’t answer that they used Amboss since it seems safe to assume they didn’t complete any of the Amboss questions. For people who did select that they used Amboss but didn’t answer a percentage will be treated as missing

If less than 30% are missing, I may consider performing bootstrap sampling of known values to replace missing values.

After accounting for missingness, I will assess for co-linearity of the predictors (ie. correlation) using VIF. If there is high co-linearity, we will use LASSO to perform variable selection. If there is no evidence of concerning levels of colinearity, I will proceed with linear regression.

Code
#here I am creating the cleaned Uworld and Amboss variables. if they did not check that they used it, the percentage is set to 0.
# if they did use it then the next ifelse checks if there is an NA where there should be a percent
# if there is an NA where there should be a percent, we maintain this NA. If there is not, then we can report the percentage.
# same process for uworld
step2_complete <- step2_complete %>% mutate(amboss_clean = ifelse(I.amboss == 0, 0, 
                                    ifelse(is.na(step2_complete$amboss_percent_step2) == TRUE, NA, step2_complete$amboss_percent_step2)),
                                    
                                    uworld_clean = ifelse(I.uworld == 0, 0, 
                                    ifelse(is.na(step2_complete$uworld_percent_step2 ) == TRUE, NA, step2_complete$uworld_percent_step2)))

#inspecting percent missing, it seems like most responses are now complete enough for analysis
step2_complete$amboss_percent_step2 <- ifelse(is.na(step2_complete$amboss_percent_step2) == TRUE, 0, step2_complete$amboss_percent_step2)
class(step2_complete$practicetest_step2) <- "integer"

step2_complete[,"on_target"] <-  factor(step2_complete$target_score_step2, levels = c(1,2,3), labels = c("at target", "above target", "below target"))


# I looked into the difference between practice_test_step2 (the final practice test I took before my exam was... 8 options) and practicetest_step2 (text response of how many practice tests did you take before step 2).
# I decided to use practicetest_step2 (text response of how many practice tests did you take before step 2) but extract and recode it as numeric below
step2_complete[, "practice_test_2_clean"] <- ifelse(is.na(step2_complete$practicetest_step2) == TRUE, NA, substr(step2_complete$practicetest_step2, start = 1, stop = 1))
step2_complete[, "number_of_practice_tests"] <- as.numeric(step2_complete$practice_test_2_clean) #I want to include the number of practice tests as just one numeric covariate, not coded as a factor like it is in practice_test_2_clean

step2_complete$full_practice_test <- factor(step2_complete$full_test_practice_step2 , levels = c(1:4), labels = c("Yes and I'm glad I did", "Yes and it was unnecessary", "No and I wish I did", "No and I'm glad I did not"))
step2_complete$simulate_full_practice <- ifelse(step2_complete$full_test_practice_step2 <= 2, 1, 0)
step2_complete$simulate_full_practice <- factor(step2_complete$simulate_full_practice, levels = c(0, 1), labels = c("No", "Yes"))
step2_complete$length_step2 <- factor(step2_complete$length_step2 , levels = c(1:6), labels = c("less than 1 week", "1-2 weeks", "3-4 weeks", "5-6 weeks", "7-8 weeks", "more than 8 weeks"))


######## profile missingness in the step 2 data and address
step2_profile <- step2_complete %>% relocate(c(score_step2, uworld_clean, amboss_clean, length_step2, simulate_full_practice, practice_score_step2, number_of_practice_tests, school_factor), .before = schoolid)

percents_missing <-round(colSums(is.na(step2_profile))/nrow(step2_profile), 3)*100
kable(percents_missing, caption = "Percent missing observations for pooled Step 2 survey")
Percent missing observations for pooled Step 2 survey
x
record_id 0.0
score_step2 0.0
uworld_clean 1.4
amboss_clean 9.4
length_step2 1.4
simulate_full_practice 0.0
practice_score_step2 29.7
number_of_practice_tests 5.1
school_factor 0.0
schoolid 0.0
exam_order 0.0
uworld_percent_step2 1.4
amboss_percent_step2 0.0
practicetest_step2 5.1
full_test_practice_step2 0.0
practice_test_step2 29.0
target_score_step2 0.0
I.uworld 0.0
I.firstaid 0.0
I.anki 0.0
I.sketchy 0.0
I.amboss 0.0
I.pathoma 0.0
I.boardsbeyond 0.0
I.other 0.0
order_factor 0.0
which_exam_first 0.0
on_target 0.0
practice_test_2_clean 5.1
full_practice_test 0.0

Multiple linear regression with:

  • Y = Step 2 score (from “What was your 3-digit score on the official Step 2 exam?”)

  • X1 = % UWorld (recoded from “Approximately what % of UWorld did you complete during your protected study period for step 2?”)

  • X2 = % Amboss (recoded from “Approximately what % of the Amboss question bank did you complete during your Step 2 protected study period?”)

  • X3 = length study (factor from “How long did you study for Step 2 during a protected study period?”)

  • X4 = number of practice tests (recoded from a text answer from “How many total practice tests did you take before Step 2?”)

  • X5 = full test day (yes/no code as binary from “Did you simulate a full Step 2 test day experience, e.g. 8 blocks of 40 UWorld questions or 2 of the 4-section practice tests?”)

  • X6 = final practice score (What was your 3 digit score on the final practice test you took before the Step 2 exam?)

  • Z = School (need to adjust for this in order to remove any influence)

Careful to note that not all cases are complete - for example there are 399 responses in the complete step 2 dataset, of which for the number of practice tests taken, 108 are missing and 291 have a response recorded.

Below I report the model results, sandwich variance, and VIF for step 2 scores model. After that, I show descriptive statistics for predictors and the outcome that was included in the model.

Code
mod_step2_scores <- lm(score_step2 ~ uworld_clean + amboss_clean + length_step2 + simulate_full_practice + practice_score_step2 + number_of_practice_tests + school_factor, data = step2_complete) 
summary(mod_step2_scores)

Call:
lm(formula = score_step2 ~ uworld_clean + amboss_clean + length_step2 + 
    simulate_full_practice + practice_score_step2 + number_of_practice_tests + 
    school_factor, data = step2_complete)

Residuals:
    Min      1Q  Median      3Q     Max 
-21.360  -3.463   1.966   5.068  15.498 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                   111.950778  18.543622   6.037 6.22e-08 ***
uworld_clean                    0.003943   0.044241   0.089   0.9292    
amboss_clean                    0.041176   0.042959   0.959   0.3410    
length_step25-6 weeks           1.930819   2.221260   0.869   0.3876    
length_step27-8 weeks          -4.190772   2.771947  -1.512   0.1349    
length_step2more than 8 weeks  -7.817921   6.423664  -1.217   0.2276    
simulate_full_practiceYes       0.410541   2.165465   0.190   0.8502    
practice_score_step2            0.599139   0.071590   8.369 3.14e-12 ***
number_of_practice_tests       -1.156330   0.672732  -1.719   0.0899 .  
school_factorUVA                0.845109   2.534346   0.333   0.7398    
school_factorVUMS              -1.417165   2.459452  -0.576   0.5663    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.204 on 72 degrees of freedom
  (55 observations deleted due to missingness)
Multiple R-squared:  0.5603,    Adjusted R-squared:  0.4992 
F-statistic: 9.174 on 10 and 72 DF,  p-value: 1.401e-09
Code
sandwich(mod_step2_scores)
                               (Intercept)  uworld_clean  amboss_clean
(Intercept)                   340.41118113 -0.0783352214  0.1696981274
uworld_clean                   -0.07833522  0.0011037779  0.0001622475
amboss_clean                    0.16969813  0.0001622475  0.0010885950
length_step25-6 weeks          -1.73083732  0.0036043139 -0.0021614635
length_step27-8 weeks          -2.77583465  0.0234623553  0.0010957333
length_step2more than 8 weeks   0.13362159 -0.0150723533  0.0350055231
simulate_full_practiceYes       7.36295833 -0.0171304514 -0.0007170592
practice_score_step2           -1.21531124 -0.0002017563 -0.0006727492
number_of_practice_tests       -1.93388529  0.0058264396 -0.0034664273
school_factorUVA              -21.61708719  0.0270290610 -0.0257822808
school_factorVUMS             -14.59580072  0.0152415741  0.0048932951
                              length_step25-6 weeks length_step27-8 weeks
(Intercept)                            -1.730837323          -2.775834646
uworld_clean                            0.003604314           0.023462355
amboss_clean                           -0.002161464           0.001095733
length_step25-6 weeks                   3.150662734           1.835473954
length_step27-8 weeks                   1.835473954           6.149906734
length_step2more than 8 weeks           1.290245222           1.274612466
simulate_full_practiceYes               0.410599462          -0.066836678
practice_score_step2                   -0.001350060          -0.005486582
number_of_practice_tests               -0.087789974          -0.063694975
school_factorUVA                        0.626140189           1.314482146
school_factorVUMS                       0.456557913           1.227766263
                              length_step2more than 8 weeks
(Intercept)                                      0.13362159
uworld_clean                                    -0.01507235
amboss_clean                                     0.03500552
length_step25-6 weeks                            1.29024522
length_step27-8 weeks                            1.27461247
length_step2more than 8 weeks                  126.15644668
simulate_full_practiceYes                       -1.83397059
practice_score_step2                             0.01240497
number_of_practice_tests                        -0.66288672
school_factorUVA                                -1.87152051
school_factorVUMS                                0.38067455
                              simulate_full_practiceYes practice_score_step2
(Intercept)                                7.3629583303        -1.2153112380
uworld_clean                              -0.0171304514        -0.0002017563
amboss_clean                              -0.0007170592        -0.0006727492
length_step25-6 weeks                      0.4105994618        -0.0013500601
length_step27-8 weeks                     -0.0668366777        -0.0054865823
length_step2more than 8 weeks             -1.8339705896         0.0124049677
simulate_full_practiceYes                  4.1076618930        -0.0225018643
practice_score_step2                      -0.0225018643         0.0046587629
number_of_practice_tests                  -0.1453023272         0.0009702705
school_factorUVA                          -1.5115492653         0.0604174269
school_factorVUMS                         -0.5325305293         0.0388873822
                              number_of_practice_tests school_factorUVA
(Intercept)                              -1.9338852900     -21.61708719
uworld_clean                              0.0058264396       0.02702906
amboss_clean                             -0.0034664273      -0.02578228
length_step25-6 weeks                    -0.0877899742       0.62614019
length_step27-8 weeks                    -0.0636949754       1.31448215
length_step2more than 8 weeks            -0.6628867246      -1.87152051
simulate_full_practiceYes                -0.1453023272      -1.51154927
practice_score_step2                      0.0009702705       0.06041743
number_of_practice_tests                  0.2744033199       0.20364601
school_factorUVA                          0.2036460086       6.05601479
school_factorVUMS                         0.1089025371       2.88708848
                              school_factorVUMS
(Intercept)                       -14.595800716
uworld_clean                        0.015241574
amboss_clean                        0.004893295
length_step25-6 weeks               0.456557913
length_step27-8 weeks               1.227766263
length_step2more than 8 weeks       0.380674551
simulate_full_practiceYes          -0.532530529
practice_score_step2                0.038887382
number_of_practice_tests            0.108902537
school_factorUVA                    2.887088480
school_factorVUMS                   4.279247727
Code
vif(mod_step2_scores)
                 uworld_clean                  amboss_clean 
                     1.227399                      1.236755 
        length_step25-6 weeks         length_step27-8 weeks 
                     1.503404                      1.609434 
length_step2more than 8 weeks     simulate_full_practiceYes 
                     1.196735                      1.188729 
         practice_score_step2      number_of_practice_tests 
                     1.163156                      1.221483 
             school_factorUVA             school_factorVUMS 
                     1.828295                      1.570226 
Code
mod_step2_scores %>% tbl_regression() #this makes a cleaner looking summary table
Characteristic Beta 95% CI1 p-value
uworld_clean 0.00 -0.08, 0.09 >0.9
amboss_clean 0.04 -0.04, 0.13 0.3
length_step2


    3-4 weeks
    5-6 weeks 1.9 -2.5, 6.4 0.4
    7-8 weeks -4.2 -9.7, 1.3 0.13
    more than 8 weeks -7.8 -21, 5.0 0.2
simulate_full_practice


    No
    Yes 0.41 -3.9, 4.7 0.9
practice_score_step2 0.60 0.46, 0.74 <0.001
number_of_practice_tests -1.2 -2.5, 0.18 0.090
school_factor


    HMS
    UVA 0.85 -4.2, 5.9 0.7
    VUMS -1.4 -6.3, 3.5 0.6
1 CI = Confidence Interval

Trying the model with imputation, but there is an issue right now with just imputing one value instead of randomly sampling…

Note

imputation needs to be fixed before interpreting this model! see the snippet of the updated data showing this issue

Code
#following this thought process for the variables that had missingness for the 2nd model
# df %>% mutate(var = ifelse(!is.na(var),var, sample(var[!is.na(var)],1)))
# as described on this blog https://community.rstudio.com/t/replacing-nas-with-random-values-from-a-set/158833/2 
## model variables were uworld_clean + amboss_clean + length_step2 + simulate_full_practice + practice_score_step2 + number_of_practice_tests + school_factor
# with missingness in uworld_clean, amboss_clean, length_step2, practice_score_step2, and number_of_practice_tests

step2_impute <-
  step2_complete %>% select(
    score_step2,
    uworld_clean,
    amboss_clean,
    length_step2,
    simulate_full_practice,
    practice_score_step2,
    number_of_practice_tests,
    school_factor
  ) %>%     mutate(uworld = ifelse(!is.na(uworld_clean), uworld_clean, sample(uworld_clean[!is.na(uworld_clean)], 1)),
                   amboss = ifelse(!is.na(amboss_clean), amboss_clean, sample(amboss_clean[!is.na(amboss_clean)], 1)),
                   length = factor(ifelse(!is.na(length_step2), length_step2, sample(length_step2[!is.na(length_step2)], 1)), levels = c(1:6), labels = c("less than 1 week", "1-2 weeks", "3-4 weeks", "5-6 weeks", "7-8 weeks", "more than 8 weeks")),
                   prac_score = ifelse(!is.na(practice_score_step2), practice_score_step2, sample(practice_score_step2[!is.na(practice_score_step2)], 1)),
                   num_tests = ifelse(!is.na(number_of_practice_tests), number_of_practice_tests, sample(number_of_practice_tests[!is.na(number_of_practice_tests)], 1))  
                   )

colSums(is.na(step2_impute)) #to confirm the random sampling replacement worked
             score_step2             uworld_clean             amboss_clean 
                       0                        2                       13 
            length_step2   simulate_full_practice     practice_score_step2 
                       2                        0                       41 
number_of_practice_tests            school_factor                   uworld 
                       7                        0                        0 
                  amboss                   length               prac_score 
                       0                        0                        0 
               num_tests 
                       0 
Code
#here is a view of one change:
# cbind(step2_impute$length_step2, step2_impute$length)
cbind(step2_impute$practice_score_step2, step2_impute$prac_score)[1:22,]
      [,1] [,2]
 [1,]  266  266
 [2,]  251  251
 [3,]  260  260
 [4,]   NA  242
 [5,]  250  250
 [6,]  269  269
 [7,]  249  249
 [8,]  259  259
 [9,]  230  230
[10,]  237  237
[11,]  258  258
[12,]  265  265
[13,]  256  256
[14,]   NA  242
[15,]   NA  242
[16,]  273  273
[17,]   NA  242
[18,]  268  268
[19,]  274  274
[20,]   NA  242
[21,]  245  245
[22,]  247  247
Code
mod_impute_step2_scores <- lm(score_step2 ~ uworld + amboss + length + simulate_full_practice + prac_score + num_tests + school_factor, data = step2_impute) 
summary(mod_impute_step2_scores)

Call:
lm(formula = score_step2 ~ uworld + amboss + length + simulate_full_practice + 
    prac_score + num_tests + school_factor, data = step2_impute)

Residuals:
    Min      1Q  Median      3Q     Max 
-28.806  -4.551   1.126   5.524  29.251 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)               106.61885   16.91198   6.304 4.46e-09 ***
uworld                      0.04551    0.03892   1.169    0.244    
amboss                      0.05191    0.03877   1.339    0.183    
length3-4 weeks            15.45901   10.25516   1.507    0.134    
length5-6 weeks            14.84742   10.29141   1.443    0.152    
length7-8 weeks            10.54125   10.47093   1.007    0.316    
lengthmore than 8 weeks   -13.12656   11.86799  -1.106    0.271    
simulate_full_practiceYes   2.11672    1.86930   1.132    0.260    
prac_score                  0.55386    0.05680   9.751  < 2e-16 ***
num_tests                  -0.63148    0.58549  -1.079    0.283    
school_factorUVA           -0.23933    2.33066  -0.103    0.918    
school_factorVUMS          -2.66193    2.39853  -1.110    0.269    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.663 on 126 degrees of freedom
Multiple R-squared:  0.597, Adjusted R-squared:  0.5618 
F-statistic: 16.97 on 11 and 126 DF,  p-value: < 2.2e-16
Code
sandwich(mod_impute_step2_scores)
                          (Intercept)        uworld        amboss
(Intercept)               518.8404312 -1.126595e-01  1.874483e-01
uworld                     -0.1126595  9.684206e-04 -9.612629e-05
amboss                      0.1874483 -9.612629e-05  8.876163e-04
length3-4 weeks            17.9707738 -7.703520e-02 -5.880233e-03
length5-6 weeks            10.5789352 -7.404632e-02 -1.615869e-02
length7-8 weeks            14.4037086 -5.474271e-02 -1.207826e-02
lengthmore than 8 weeks    -0.1882986 -3.014702e-02  2.488808e-02
simulate_full_practiceYes  -9.3723456 -5.583343e-03 -1.353676e-02
prac_score                 -2.0384644  3.174484e-04 -6.772935e-04
num_tests                   0.8218856  4.687362e-03 -4.611726e-04
school_factorUVA          -17.7240751  1.961091e-02 -7.853974e-03
school_factorVUMS          -3.0624374  1.112401e-02  5.942170e-03
                          length3-4 weeks length5-6 weeks length7-8 weeks
(Intercept)                  17.970773800     10.57893524     14.40370855
uworld                       -0.077035203     -0.07404632     -0.05474271
amboss                       -0.005880233     -0.01615869     -0.01207826
length3-4 weeks               9.677640753      8.58122736      7.29620191
length5-6 weeks               8.581227355     10.13157896      7.84391035
length7-8 weeks               7.296201908      7.84391035     11.82362844
lengthmore than 8 weeks       0.088176199     -0.14098643     -2.66023850
simulate_full_practiceYes     2.483701761      2.84367975      2.68039706
prac_score                   -0.067985548     -0.04071236     -0.06003546
num_tests                    -0.575117578     -0.70332190     -0.69805987
school_factorUVA             -1.660338534     -0.86788814     -0.02340190
school_factorVUMS            -1.405433871     -1.02857769      0.10444256
                          lengthmore than 8 weeks simulate_full_practiceYes
(Intercept)                          -0.188298607              -9.372345623
uworld                               -0.030147019              -0.005583343
amboss                                0.024888078              -0.013536759
length3-4 weeks                       0.088176199               2.483701761
length5-6 weeks                      -0.140986434               2.843679749
length7-8 weeks                      -2.660238500               2.680397057
lengthmore than 8 weeks             173.067318804              -1.163835602
simulate_full_practiceYes            -1.163835602               3.406591120
prac_score                            0.002253545               0.029306137
num_tests                             0.504661243              -0.118816387
school_factorUVA                     -0.466031216              -0.725214704
school_factorVUMS                    -0.349550555              -0.449761151
                             prac_score     num_tests school_factorUVA
(Intercept)               -2.038464e+00  0.8218855727    -17.724075069
uworld                     3.174484e-04  0.0046873618      0.019610907
amboss                    -6.772935e-04 -0.0004611726     -0.007853974
length3-4 weeks           -6.798555e-02 -0.5751175775     -1.660338534
length5-6 weeks           -4.071236e-02 -0.7033219048     -0.867888135
length7-8 weeks           -6.003546e-02 -0.6980598706     -0.023401905
lengthmore than 8 weeks    2.253545e-03  0.5046612431     -0.466031216
simulate_full_practiceYes  2.930614e-02 -0.1188163866     -0.725214704
prac_score                 8.138307e-03 -0.0058347429      0.054652601
num_tests                 -5.834743e-03  0.2084142831      0.046196846
school_factorUVA           5.465260e-02  0.0461968462      4.927882528
school_factorVUMS         -4.561237e-05  0.0293534203      3.346184403
                          school_factorVUMS
(Intercept)                   -3.062437e+00
uworld                         1.112401e-02
amboss                         5.942170e-03
length3-4 weeks               -1.405434e+00
length5-6 weeks               -1.028578e+00
length7-8 weeks                1.044426e-01
lengthmore than 8 weeks       -3.495506e-01
simulate_full_practiceYes     -4.497612e-01
prac_score                    -4.561237e-05
num_tests                      2.935342e-02
school_factorUVA               3.346184e+00
school_factorVUMS              5.066379e+00
Code
vif(mod_impute_step2_scores)
                   uworld                    amboss           length3-4 weeks 
                 1.217238                  1.094273                 36.765374 
          length5-6 weeks           length7-8 weeks   lengthmore than 8 weeks 
                37.946253                 23.278015                  4.426581 
simulate_full_practiceYes                prac_score                 num_tests 
                 1.134787                  1.217212                  1.205791 
         school_factorUVA         school_factorVUMS 
                 2.000107                  1.578556 
Code
mod_impute_step2_scores %>% tbl_regression() #this makes a cleaner looking summary table
Characteristic Beta 95% CI1 p-value
uworld 0.05 -0.03, 0.12 0.2
amboss 0.05 -0.02, 0.13 0.2
length


    1-2 weeks
    3-4 weeks 15 -4.8, 36 0.13
    5-6 weeks 15 -5.5, 35 0.2
    7-8 weeks 11 -10, 31 0.3
    more than 8 weeks -13 -37, 10 0.3
simulate_full_practice


    No
    Yes 2.1 -1.6, 5.8 0.3
prac_score 0.55 0.44, 0.67 <0.001
num_tests -0.63 -1.8, 0.53 0.3
school_factor


    HMS
    UVA -0.24 -4.9, 4.4 >0.9
    VUMS -2.7 -7.4, 2.1 0.3
1 CI = Confidence Interval

2.2 Q3 What is associated (in this data) with pushing back a Step 1 exam date?

Here, I will perform logistic regression with

Y = yes or no (1 = yes, 2 = no for “push_step1”)

There may not be sufficient data on this since only 20 people responded that they decided to push back Step 1. The factors that were measured are:

  • push remember step1 (1 = I only remember the form name, 2 = I only remember the score, 3 = I remember the form name and the score, 4 = I don’t remember either) - we decided not to include this variable (can change later if desired)

  • push score only step 1 (1 = NBME, 2 = Uworld)

  • push practice test step 1 ( 1 - 8 listing various exams)

  • push nbme practice score (from 0 to 100%)

  • push uw practice score (from 180 to 300)

Listing variables by name and if I have included them:

  • “push_step1_1” yes

  • “push_remember_step1_1” not included b/c a precursor question

  • “push_score_only_step1_1” not currently included but could be

  • “push_practice_test_step1_1” yes

  • “push_nbme_practice_score_step1_1” yes

  • “push_uw_practice_score_step1_1” yes

Code
step1_complete$push_step1 <- ifelse(step1_complete$push_step1 == 2 | is.na(step1_complete$push_step1) ==TRUE, 2, 1) #recording the NA's to be "No" (they did not push back step 1)
step1_complete$push_step1_label <- factor(step1_complete$push_step1, levels = c(1, 2), labels = c("Yes", "No")) #making a nice descriptive label

#did_push_df <- step1_complete %>% filter(push_step1_label == "Yes")
#dat %>% filter(!is.na(push_remember_step1_1))

step1_complete$push_practice_test_step1 <- factor(step1_complete$push_practice_test_step1, levels = c(1:8), labels = c("NBME 25", "NBME 26", "NBME 27", "NBME 28", "NBME 29", "NBME 30", "UWorld 1", "UWorld 2"))

#making labels for descriptive table
units(step1_complete$push_uw_practice_score_step1) <- "score units"
units(step1_complete$push_nbme_practice_score_step1) <- "percent"
label(step1_complete$push_practice_test_step1) <- "exam that triggered pushing back"
label(step1_complete$push_nbme_practice_score_step1) <- "NBME P(passing) that triggered pushing back"
label(step1_complete$push_uw_practice_score_step1) <- "3 digit UWorld score that triggered pushing back"
caption <- "Description of indiviuals that pushed back Step 1"
table1(~ push_practice_test_step1 + push_uw_practice_score_step1 + push_nbme_practice_score_step1 |push_step1_label, data=step1_complete, topclass="Rtable1-zebra")
Yes
(N=20)
No
(N=162)
Overall
(N=182)
exam that triggered pushing back
NBME 25 0 (0%) 0 (0%) 0 (0%)
NBME 26 0 (0%) 0 (0%) 0 (0%)
NBME 27 1 (5.0%) 0 (0%) 1 (0.5%)
NBME 28 1 (5.0%) 0 (0%) 1 (0.5%)
NBME 29 0 (0%) 0 (0%) 0 (0%)
NBME 30 0 (0%) 0 (0%) 0 (0%)
UWorld 1 0 (0%) 0 (0%) 0 (0%)
UWorld 2 1 (5.0%) 0 (0%) 1 (0.5%)
Missing 17 (85.0%) 162 (100%) 179 (98.4%)
3 digit UWorld score that triggered pushing back (score units)
Mean (SD) 180 (NA) NA (NA) 180 (NA)
Median [Min, Max] 180 [180, 180] NA [NA, NA] 180 [180, 180]
Missing 19 (95.0%) 162 (100%) 181 (99.5%)
NBME P(passing) that triggered pushing back (percent)
Mean (SD) 65.6 (18.3) NA (NA) 65.6 (18.3)
Median [Min, Max] 67.5 [30.0, 87.0] NA [NA, NA] 67.5 [30.0, 87.0]
Missing 12 (60.0%) 162 (100%) 174 (95.6%)

2.3 Descriptive Statistics (now just needs to look uniform)

Descriptive statistics will be reported in total and not by school (as requested by the collaborators).

  • Question banks

    • % UWorld for step 1 and 2
    • % Amboss for step 1 and 2
  • length of study for step 1 and step 2 (barplot with frequency table)

  • number of practice tests (recoded from a text answer from “How many total practice tests did you take before Step 2?”)

  • full test day as factor from “Did you simulate a full Step 2 test day experience, e.g. 8 blocks of 40 UWorld questions or 2 of the 4-section practice tests?” “full_practice_test”

  • histogram of Step 2 scores

  • frequency table of Step 1 scores

  • what resources are most widely used barplot

  • number of practice tests histogram (among the people who answered the question) for step 1

  • summarize comments for other_resources, other_step, changes_step

Practice Scores for Step 1 and Step 2 - probability of passing for step 1 - final_nbme_practice_score_step1 is a percent from 0 to 100 - final_uw_practice_score_step1 is a 3 digit UWorld score for step 1 - final practice score for step 2 (3 digit score)

looking at the data dictionary, we are given final_nbme_practice _score_step1 and final_uw_practice_s core_step1 which are two different covariates which are ONLY in Step 1.

Since everyone passed step 1, I did not run a regression model that included these covariates. Instead, I will make a plot to show this in descriptive data.

These specific questions were not included for step 2, instead there is a question for practice_score_step2, which I do include in the model for step 2 score predictors above.

Code
# Here is how I am planning to organize it:
# 1   Question banks
#       -   % UWorld for step 1 and 2
#       -   % Amboss for step 1 and 2
# 2   length of study for step 1 and step 2 (barplot with frequency table)
# 3   # of practice tests (recoded from a text answer from "How many total practice tests did you take before Step 2?")
# 4   number of practice tests histogram (among the people who answered the question) for step 1
# 5   full test day as factor from "Did you simulate a full Step 2 test day experience, e.g. 8 blocks of 40 UWorld questions or 2 of the 4-section practice tests?" "full_practice_test"
# 6   histogram of Step 2 scores
# 7   frequency table of Step 1 scores
# 8   what resources are most widely used barplot
# 9   practice score step 2
# 10   summarize comments for other_resources, other_step, changes_step
# 11 Practice Scores for Step 1 and Step 2
#  11a probability of passing for step 1
#   11b  final_nbme_practice_score_step1 is a percent from 0 to 100
#   11c   final_uw_practice_score_step1 is a 3 digit UWorld score for step 1
# 11d   final practice score for step 2 (3 digit score)
####  and these are covered in other plots already (as numbered)
# 11b looking at the data dictionary, we are given final_nbme_practice _score_step1 and final_uw_practice_s core_step1 which are two different covariates which are ONLY in Step 1.
# 7 Since everyone passed step 1, I did not run a regression model that included these covariates. Instead, I will make a plot to show this in descriptive data.
# 9 These specific questions were not included for step 2, instead there is a question for practice_score_step2, which I do include in the model for step 2 score predictors above.

############################ now see the plots below as organized by the list I made above
# 1   Question banks
#       -   % UWorld for step 1 and 2
#       -   % Amboss for step 1 and 2
qudf <- data.frame(amount = c(step1_complete$uworld_percent_step1, 
                             step1_complete$amboss_percent_step1, 
                             step2_complete$uworld_percent_step2, 
                             step2_complete$amboss_percent_step2), 
                  step_exam = c(rep("Step 1", times = nrow(step1_complete)*2), 
                                rep("Step 2", times = nrow(step2_complete)*2)), 
                  resource_name = c(rep("UWorld", times = nrow(step1_complete)), 
                                    rep("Amboss", times = nrow(step1_complete)), 
                                    rep("Uworld", times = nrow(step2_complete)), 
                                    rep("Amboss", times = nrow(step2_complete))))

ggplot(qudf, aes(amount)) + geom_histogram() + facet_wrap(~step_exam + resource_name) + theme_minimal() + scale_fill_brewer(palette="Blues")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 209 rows containing non-finite values (`stat_bin()`).

Code
# 2   length of study for step 1 and step 2 (barplot with frequency table)
## how long did the students study barplot
       # for step 1
step1_complete$length_step1 <- factor(step1_complete$length_step1 , levels = c(1:6), labels = c("less than 1 week", "1-2 weeks", "3-4 weeks", "5-6 weeks", "7-8 weeks", "more than 8 weeks"))
ggplot(data.frame(step1_complete), aes(x=length_step1)) + geom_bar() + theme_minimal() + xlab("Time") + ylab("Frequency") + labs(title = "How long did you study for Step 1 during a protected study period?")

Code
       # for step 2
ggplot(data.frame(step2_complete), aes(x=length_step2)) + geom_bar() + theme_minimal() + xlab("Time") + ylab("Frequency") + labs(title = "How long did you study for Step 2 during a protected study period?")

Code
# 3   # of practice tests (recoded from a text answer from "How many total practice tests did you take before Step 2?")
ggplot(aes(x = number_of_practice_tests), data= step2_complete) + geom_histogram() + theme_minimal() + xlab("Number of tests") + ylab("Frequency") + labs(title = "How many total practice tests did you take before Step 2?")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 7 rows containing non-finite values (`stat_bin()`).

Code
# 4   number of practice tests histogram (among the people who answered the question) for step 1
ggplot(aes(x = practicetest_step1), data= step1_complete) + geom_histogram() + theme_minimal() + xlab("Number of tests") + ylab("Frequency") + labs(title = "How many total practice tests did you take before Step 1?") ## number of practice tests histogram (among the people who answered the question)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 29 rows containing non-finite values (`stat_bin()`).

Code
# 5   full test day as factor from "Did you simulate a full Step 2 test day experience, e.g. 8 blocks of 40 UWorld questions or 2 of the 4-section practice tests?" "full_practice_test"
ggplot(aes(x = full_practice_test), data= step2_complete) + geom_bar() + theme_minimal() + xlab("Response") + ylab("Frequency") + labs(title = "Did you simulate a full Step 2 test day experience, \n e.g. 8 blocks of 40 UWorld questions or 2 of the 4-section practice tests?")

Code
# 6   histogram of Step 2 scores
ggplot(aes(x = score_step2), data= step2_complete) + geom_histogram() + theme_minimal() + xlab("Self Reported Step 2 Score") + ylab("Frequency") + labs(title = "Frequency of reported Step 2 Scores")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Code
# 7   frequency table of Step 1 scores
kable(table(step1_complete$score_step1), caption = "Frequency of passing Step 1") # would be nice to have a way to summarize these in the same table, but this can just be remade with a table in Word
Frequency of passing Step 1
Var1 Freq
1 154
Code
kable(table(is.na(step1_complete$score_step1)), caption = "True indicates missing values for Step 1 score")
True indicates missing values for Step 1 score
Var1 Freq
FALSE 154
TRUE 28
Code
# 8   what resources are most widely used barplot 
resources_2 <- step2_complete %>% select(I.uworld:I.other) %>% summarise(across(everything(), ~ sum(.)))
resources_1 <- step1_complete %>% select(I.uworld:I.other) %>% summarise(across(everything(), ~ sum(.)))
rdf <- data.frame(amount = c(t(resources_1), t(resources_2)), step_exam = c(rep("Step 1", 8), rep("Step 2", 8)), 
                  resource_name = rep(names(resources_1), times = 2))
# http://www.sthda.com/english/wiki/ggplot2-barplots-quick-start-guide-r-software-and-data-visualization used this as a guide
#what resources are most widely used? I want to sort this barplot in order of frequency but for some reason this wasn't working for me with reorder()
ggplot(data=rdf, aes(x=resource_name, y=amount, fill = step_exam)) +
  geom_bar(stat="identity", position=position_dodge())+
  theme_minimal() + coord_flip() + scale_fill_brewer(palette="Blues")

Code
# 9   practice score step 2
ggplot(aes(x = practice_score_step2), data= step2_complete) + geom_histogram() + scale_fill_brewer(palette="Blues") + theme_minimal() + xlab("Self Reported Step 2 Practice Score") + ylab("Frequency") + labs(title = "Reported Step 2 Practice Scores")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 41 rows containing non-finite values (`stat_bin()`).

Code
# 10   summarize comments for other_resources, other_step, changes_step
    ## summarize comments (other_resources, other_step, changes_step)
    #"other_resources_step1"           "other_step1"                     "changes_step1"   
    kable(table(step1_complete$other_resources_step1), caption = "Other listed resources for Step 1")
Other listed resources for Step 1
Var1 Freq
121
Dirty Medicine 1
Dirty Medicine YouTube videos 1
NBME and Dr. Legallo's videos and notes from preclinical 1
nbme practice exams 1
NBME practice tests 1
Online MedEd 1
Pixorize 1
Youtube 1
Code
    #If you used other resources not listed above, would you use them again?
    step1_complete$other_hascontents <- as.integer(nchar(step1_complete$other_resources_step1) > 0) #flagging the rows that have contents https://stackoverflow.com/questions/64744988/testing-to-see-if-characters-are-present-in-a-cell-in-r
    as_tibble(step1_complete[which(step1_complete$other_hascontents == 1), c("other_resources_step1", "other_step1") ])
# A tibble: 8 × 2
  other_resources_step1                                    other_step1          
  <chr>                                                    <chr>                
1 Pixorize                                                 Yes, helped immensel…
2 Dirty Medicine YouTube videos                            yes                  
3 Dirty Medicine                                           Yes                  
4 nbme practice exams                                      yes                  
5 NBME practice tests                                      Yes                  
6 Youtube                                                  Yes                  
7 Online MedEd                                             Maybe                
8 NBME and Dr. Legallo's videos and notes from preclinical yes, NBME and Dr. Le…
Code
    # what would you change?
    step1_complete$change_hascontents <- as.integer(nchar(step1_complete$changes_step1) > 0) 
    as.data.frame(step1_complete[which(step1_complete$change_hascontents == 1), "changes_step1" ])
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         step1_complete[which(step1_complete$change_hascontents == 1), "changes_step1"]
1                                                                                                                                                                                                                                                                                                                              I would ditch pathoma and just do Anki, UWorld, sketchy and pixorize just for Biochem and immunology. I would focus my UWorld on things our preclinical curriculum does not cover well like anatomy, pharmacology, microbiology, etc. I would focus less on clinical info since we were coming off a year of clerkships.
2                                                                                                                                                                                                                                                                                                                                                                                                                                                                              I did it in 3 weeks and it was VERY stressful. However, this ended up being more than enough time. I would feel reassured that again it is PASS/FAIL so it will be OK!! 
3                                                                                                                                                                                                                                                 Would have started earlier in preclinical - would have ignored my preclinical classes and used more outside resources to build foundation. It was difficult to get up to speed for Step 1, and I regret not studying more in preclinical as I think my Step 2 score would have been significantly higher. I feel that what was not discussed was how much Step 1 content is still relevant to Step 2.
4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             I would have studied for 4 weeks instead of 6 weeks. However, I took my test after thanksgiving so there were less test dates available. 
5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        More SketchyMicro. Start using outside resources during 1st year. More UWorld.
6                                                                                                                                                                                                                                                                                                                                                                                                 I could have studied less (like 3-4 weeks) because I was consistently scoring well enough to pass on my practice tests but I was too nervous to take any chances and decided to instead just use the entire 5 weeks I had already decided to allocate
7                                                                                                                                                                                                                                                                                                                                                                                                                                                                     I would make all of my study time part time (like I did when at home for the first 4 weeks). I wouldn't let classmates studying 24/7 (unnecessarily in my opinion) stress me out.
8                                                                                                                                                                                                                                           I took about 3 weeks to review Sketchy Micro/Pharm and Pathoma as well as do some Uworld blocks and practice tests. After scoring so highly on my practice test, I actually stopped studying except for finishing Sketchy. If I could go back, I would just incorporate more of Sketchy, Pathoma, and Uworld into pre-dedicated studying and then just take Step 1 with perhaps 1-2 weeks max of dedicated.
9                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Focus more on the things that overlap with step 2
10                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     I just used sketchy/anki for micro, and that was great. I don't love it for everything, but I think I might have also tried that combo for some of the biochem. 
11                                                                                                                                                                                                                                                                                                                                                                                         I would have taken the exam early. I think the school's guidance of waiting until you've gotten 99% chance of passing on 2 practice tests was probably a bit excessive but without more information I didn't want to risk not studying as much as possible. 
12                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  would have done more sketchy, moved through pathoma faster (knew most of it already from shelf exams/clinical year)
13                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       I would keep a more strict study schedule in the beginning of my study period.
14                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Taking a longer period of time for dedicated. Not done anything else but study during those months 
15                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Not do Sketchy path 
16                      In hindsight, I realize that a more effective approach would have been to dedicate approximately six weeks to my study plan. During this time, I would have worked through the entirety of UWorld, Sketchy Micro and Pharm, as well as Pathoma. Additionally, I would have made it a point to take every available UWorld and NBME practice test, ensuring that I simulated strict testing conditions.    I should note that my ability to complete only about 25% of UWorld was somewhat facilitated by my solid foundation from my first year of medical school and the experience gained from my shelf exams during the PCE.
17                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Spent more time studying during pre-clinical years 
18                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             Use less time. 4-5 weeks is good enough.
19                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              I would have taken Step 1 prior to 3rd year clerkships.
20                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   I wish it was after pre-clinical year and not after clinical year.
21                                                                                                                                                                                                                                                                                                                                                                                                                                             Since I took step 1 after clerkships, I did not have to study much. I used it as a prep time for basic sciences of step 2. I would not have bought World for step 1 as I would have been fine without it
22                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              I wish we had completed step 1 post pre clerkship with dedicated time. 
23                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   I would focus on pathophysiology section of uworld
24                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    It's really not that hard to pass. Just knowing First Aid and going through UWorld is sufficient.
25                                                                                                                                                                                                                                                                                                                                                                                                                                                                  I would have started reviewing material (i.e. watching B&B, Sketchy pharm) before dedicated started so that I could devote my dedicated time almost exclusively to UWorld questions
26                                                                                                                                                                                                                                                                                                                                                                                                                                                 Though I probably could have studied much less, I did not find the "% likelihood of passing" metric very reassuring and still ended up studying a lot of make sure it was as reassuring as possible.
27                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 I would probably study really hard for 2 weeks and take it at the end of the 2 weeks
28                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            I would probably take the test after 2 weeks instead of 3
29                                                                                                               I would have taken it immediately after ending clerkships and then gone off to enjoy my summer. I think many students would be more than capable of taking the exam following first year before the start of clerkships, and in many ways I think the increased proximity to many of the basic science concepts tested on Step 1 would have been a major advantage. Some of the studying felt like taking a step backwards instead of progressing to try to advance the depth of my clinical knowledge going into the immersion phase.
30                                                                                                                                                                                                                                                                                                                                                                                                                                I ended up taking 12 weeks. M1 was terrible for me and I didn't know how to study. So, I would go back and study more for it during M1 or M2 but even 12 weeks was pushing it for me to make up the deficit I was in 
31 I took a practice test on Day 1 of studying, which I did not find helpful. I did poorly on Day 1 and my score improved by 50 points over 2.5 weeks, so the Day 1 baseline was useless.     I would have studied for slightly longer to feel more confident and take one more practice exam. I took 1 practice exam pre-studying and 2 after studying. I passed 1 exam 8 days prior to the test. My study period was 23 days, with days off during that, and I think near 28  days would have been ideal to feel less rushed and taken an additional exam. I do not have a great memory, so I felt like I needed more time, but everyone's different.
32                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Much more uworld, less sketchy and anki
33                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 None
34                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               If anything, maybe one less week. It can definitely be done in 4 weeks but the extra time gave me a comfortable buffer
35                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       I would have a shorter study period and focused more on step 2
36                                                                                                                                                                                                                                                                                                                                                                                                 I would have switched to timed, untutored mode on U world sooner. On the real exam, my confidence was so shaky because I was used to the immediate gratification of tutor mode and being able to take my time to think through the harder questions.
37 I wish I would have started the premade anki deck during M1 instead of at the start of M2. I finished about 50% of STEP 1 cards and 75% of STEP 2 cards by the time I took Step 1 and I think it would have been much quicker to a passing score if i had a higher percentage of STEP 1 cards mastered. I did not use UWorld at all because I found that reviewing missed questions didn't help me at all. I never remembered the info or explanations unless I memorized it with anki first. Once I had things memorized, I had no issues applying the knowledge in exam form. The only questions I missed were ones that I had never seen on anki.
38                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Just do UWorld and know First Aid well, that's all you really need.
39                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               None. 
Code
    #dat$study_amount_step1_1

# 11 Practice Scores for Step 1 and Step 2
#   11a  final_nbme_practice_score_step1 is a percent from 0 to 100
#   11b  final_uw_practice_score_step1 is a 3 digit UWorld score for step 1
# 11c   final practice score for step 2 (3 digit score)
    
hist(step1_complete$final_nbme_practice_score_step1, main = "What was the probability of passing you received \n on your final NBME form before the Step 1 exam?", breaks = 40)

Code
kable(table(!is.na(step1_complete$final_nbme_practice_score_step1)), caption = "More responses are missing for P(passing step 1) from NBME") #so 118 are missing and 64 are there
More responses are missing for P(passing step 1) from NBME
Var1 Freq
FALSE 118
TRUE 64
Code
hist(step1_complete$final_uw_practice_score_step1, main = "What was the 3 digit score you received on your final UWorld form before the Step 1 exam?", breaks = 40)

Code
kable(table(is.na(step1_complete$final_uw_practice_score_step1)), caption = "Most responses are missing for UWorld Step 1 final score")
Most responses are missing for UWorld Step 1 final score
Var1 Freq
FALSE 16
TRUE 166
Code
hist(step2_complete$practice_score_step2, main = "What was your 3 digit score on the final practice test \n you took before the Step 2 exam?", breaks = 40)

Code
kable(table(is.na(step2_complete$practice_score_step2)), caption = "there are 41 missing final step2 practice scores")
there are 41 missing final step2 practice scores
Var1 Freq
FALSE 97
TRUE 41

3 Appendix/notes

All the analyses are performed using the following: - R version 4.2.2 (2022-06-24); R Core Team (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/. - Harrell Jr FE (2022). rms: Regression Modeling Strategies. R package version 6.3-0, https://CRAN.R-project.org/package=rms. The table below lists packages used in this document.

Code
subset(data.frame(sessioninfo::package_info()), attached==TRUE, c(package, loadedversion))
                  package loadedversion
cowplot           cowplot         1.1.1
dplyr               dplyr         1.1.3
forcats           forcats         1.0.0
GGally             GGally         2.1.2
ggplot2           ggplot2         3.4.4
gridExtra       gridExtra           2.3
gtsummary       gtsummary         1.7.2
Hmisc               Hmisc         5.1-1
knitr               knitr          1.44
lubridate       lubridate         1.9.3
modelsummary modelsummary         1.4.3
plotly             plotly        4.10.2
purrr               purrr         1.0.2
readr               readr         2.1.4
rms                   rms         6.7-1
sandwich         sandwich         3.0-2
scales             scales         1.2.1
stringr           stringr         1.5.0
table1             table1         1.4.3
tibble             tibble         3.2.1
tidyr               tidyr         1.3.0
tidyverse       tidyverse         2.0.0
viridis           viridis         0.6.4
viridisLite   viridisLite         0.4.2
visdat             visdat         0.6.0